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This paper presents a convolutional neural network (CNN) based non- 
invasive pathological voice detection algorithm using signal processing 
approach. The proposed algorithm extracts an acoustic feature, called 
chromagram, from voice samples and applies this feature to the input of a 
CNN for classification. The main advantage of chromagram is that it can 


mimic the way humans perceive pitch in sounds and hence can be 


considered useful to detect dysphonic voices, as the pitch in the generated 
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1. INTRODUCTION 

Voice disability is a barrier to effective human speech communication. Primarily, voice disability 
occurs due to improper function of the components that constitute the human voice generation system 
[1]-[3]. Various voice disabilities have been reported in the literature. The American speech-language- 
hearing association (ASHA) has identified dysphonia [4] as the most common voice pathology. Dysphonia 
refers to abnormal voices [5], [6] that can develop suddenly (or gradually) over time. Commonly, dysphonia 
is caused by inflamed vocal folds that cannot vibrate properly to produce normal voice sounds [7], [8]. 

Both invasive and non-invasive methods are used for pathological voice detection. In invasive 
methods, physicians insert probe into the patient’s mouth using endoscopic procedures, namely, 
laryngoscopy [9], stroboscopy [10], and laryngeal electromyography [11]. In non-invasive methods, audio 
signals are used for voice pathology detection. Audio signals including cough, breathing, and voice have 
been popularly used in many applications including telecommunication, robot control [12], domestic 
appliance control, data entry, voice recognition, disease diagnosis [13], [14], and speech-to-text conversion 
[15], [16]. Voice signals are mainly used to implement non-invasive voice pathology detection algorithms 
[17]. The two-fold objectives of these methods are i) to reduce the discomfort of a patient and ii) to assist the 
clinicians in preliminary diagnosis of voice pathology. In this method, voice samples are collected in a 
controlled environment and acoustic features are extracted from the voice samples. The next step is to 
classify the samples into two categories namely normal (i.e., healthy) and pathological using a classifier 
algorithm. Numerous classifier algorithms have been suggested in the literature for pathological voice 
detection. Recently, machine learning and deep learning algorithms have drawn considerable attention from 
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researchers [18]—[22]. Specifically, deep learning algorithms demonstrate promising results and provide high 
accuracy in voice pathology detection [19]—[22]. 

Various voice disability detection algorithms have been published in the literature. In [23], the 
authors have used a spectrogram to detect voice pathology. They have suggested minimizing the effects of 
jitter, shimmer, and harmonic-to-noise ratio (HNR) to improve the classification accuracy with the 
spectrogram. In [24], eight-voice pathologies have been investigated, and the results show that deep neural 
network (DNN) based classifier achieves a high accuracy (94.26% and 90.52% in male and female subjects, 
respectively). 

Vocal disorders, namely neoplasm, phono-trauma, and vocal palsy have been investigated in [25]. 
The authors used a dense net recurrent neural network (DNRNN) in their work. The results show that the 
DNRNN algorithm achieves an accuracy of 71%. In another similar work [26], multiple neural networks 
have been used to detect voice pathology. The authors have used multilayer perceptron neural network 
(MLPNN), general regression neural network (GRNN), and probabilistic neural network (PNN) in this work, 
and they achieved the highest accuracy with the MLPNN. Some researchers suggested using multiple voice 
features to increase accuracy. For example, the researchers in [27] have used six-voice features, namely jitter, 
shimmer, harmonic-to-noise ratio (HNR), soft phonation index (SPI), amplitude perturbation quotient (APQ), 
and relative average perturbation (RAP). They have achieved the classification accuracy of 95.20% and 
84.20% with generalized method of moments (GMM) and artificial neural network (ANN), respectively. 

Support vector machine (SVM) and radial basis function neural network (RBFNN) have been used 
in [28] to detect voice pathology. In their work, the authors have also used several features, and they 
achieved an accuracy of 91% with RBFNN. On the other hand, SVM achieved an accuracy of 83%. 
Stuttering voice has been addressed in [29]. The authors developed a classifier that can detect stuttered 
speech in their work. The results presented in their work show that these algorithms can detect stuttered 
voices with an accuracy of 85% and 78% for males and females, respectively. Four voice attributes, namely 
roughness, breathiness, asthma, and strain have been considered in [30]. The proposed algorithm used these 
features for the classification by using a feed-forward neural network (FFNN) and the algorithm achieved an 
average F-measure of 87.25%. 

Some researchers claim that spectrogram is the most suitable voice feature to detect voice pathology 
because it traces different frequencies and their occurrences in time. For example, the authors have used 
spectrogram to detect pathological voice disorder due to vocal cord paralysis in [31]. The spectrograms of 
pathological and normal speech samples are applied to the input of a convolutional deep belief network 
(CDBN). The authors have achieved 77% and 71% accuracy for CNN and CDBN, respectively. 

Dysphonic voice detection using a pre-trained CNN has been presented in [32]. The results show 
that the proposed method can detect dysphonic voices with an accuracy of 95.41%. To detect dysphonic 
voice, a new marker called the dysphonic marker index (DMI) has been introduced in [33]. The marker 
consists of four acoustic parameters. The authors have employed a regression algorithm to relate this marker 
to discriminate pathological voices from healthy ones. A novel computer-aided pathological voice 
classification system is proposed in [34]. In this work, the authors have used a deep-connected ResNet for 
classification. 

There are two significant limitations of the above-mentioned related works. One limitation is that 
none of the works considers the way human auditory system perceives the pitch contained in the sounds. 
Another limitation is that these works use classifiers that overwhelm the system with a substantial 
computational burden. The limitations mentioned above are overcome in this work by using the chromagram 
feature and a CNN. Chromagram has been commonly used to detect musical sounds, as far as our knowledge 
goes, we are the first to use the chromagram in pathological voice detection. The main contributions of this 
work are: 1) developing a novel pathological voice detection algorithm based on signal processing and a deep 
learning approach, ii) extracting the chromagram feature from the voice samples and converting them into a 
suitable form for classification, iii) achieving a high classification accuracy without overwhelming the system 
with a vast computation burden, iv) providing a detailed performance analysis of the proposed system in 
terms of classification matrix, precision, sensitivity, specificity, and Fl-score, and v) comparing the 
performances of the proposed algorithm with other related works to demonstrate its effectiveness. The rest of 
the paper is organized as follows: materials and methods are presented in section 2. The results are presented in 
section 3, and the paper is concluded with section 4. 


2. MATERIALS AND METHODS 

This investigation uses normal and dysphonic voice samples that are collected from the Saarbrucken 
voice database (SVD) [35]. The SVD database contains 687 normal (i.e., healthy) samples and 1,356 
pathological samples. The samples contain the recordings of vowels and sentences. In this investigation, the 
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sustained phonation of the vowel ‘/a/ is used. The main reason is that a speaker can maintain a steady 
frequency and amplitude at a comfortable level during the voice generation of the vowel ‘/a/’. Moreover, it is 
free of articulatory and other linguistic confounds that often exist with other common speech components, 
including sentences and running speeches. Figure 1 shows the plots for a normal voice sample and a 
dysphonic voice sample that are randomly selected from the SVD database. It is depicted in the figure that 
the dysphonic voice samples demonstrate irregular distortion both in terms of magnitude and shape compared 
to that of the healthy sample. 
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Figure 1. The normal and dysphonic voice samples 


In this work, the chromagram audio feature has been used for the classification purpose. The 
chromagram is derived based on the principle of human’s auditory perception of pitch in the sound. The pitch 
is represented as shown in Figure 2. Generally, humans perceive the pitch as extending along a scale from 
low to high on a circular scale traversing in the clockwise direction as shown in Figure 2(a). In addition, the 
pitch has a linear scale too. To accommodate both linear and circular dimensions, Shepard introduced the 
concept of chroma in [36]. He suggested that the human auditory system’s pitch perception is better 
represented as a helix as shown in Figure 2(b). According to Shepard, the pitch, p perceived by a human, can 
be expressed in terms of chroma, c, and tone-height, h by (1). 


p = he a) 
Patterson generalized Shepard’s model in [37] to find the pitch by using (2). 
f Z ghee (2) 


The chroma spectrum, S(c) is defined as the measure of the strength of a signal with a given value of 
Chroma. This is analogous to the standard Fourier power spectrum of a signal. Then, the chroma spectrum is 
extended in the time domain to create a time-frequency distribution (TFD) S(c, t). This distribution is called 
the chromagram and it represents a joint distribution of signal strength over the variables of chroma and time. 

The chromagram is generated by two major steps in this work, namely frame segmentation and 
feature calculation. Rather than using a uniform frame size, a beat-synchronous frame segmentation is used 
in this work. This allows the frame sampling to track the rhythm of the voice. The second step of the 
algorithm is the feature calculation. A 12-element representation of the spectral energy has been used to 
calculate the chroma vector. Then, the chroma vector is computed by grouping the discrete Fourier transform 
(DFT) coefficients of a short-term window into 12 bins. Each bin represents 12 equal-tempered pitch classes 
of semitone spacing. Each bin produces the mean of log-magnitudes of the respective DFT coefficients 
defined by (3). 


Xi ke0 ...11 (3) 
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where, Sẹ is a subset of the frequencies that correspond to the DFT coefficients and N, is the cardinality 
of Sẹ. This computation results in a matrix, V with elements V;, ;, where indices k and i represent pitch-class 
and frame-number, respectively. The matrix, V,; is represented in a suitable form to produce the 
chromagram. The detailed algorithm of computing the chroma vector is shown in Algorithm 1. The 
chromagram of normal and dysphonic voices is plotted in Figure 3. This figure shows the distinct differences 
between the chromagram of the normal and dysphonic voices. For example, the chromagram of the normal 
voice shows a few dominant coefficients, which are stable for a short period. However, the chromagram of 
dysphonic voice is noisier compared to that of normal voice. 
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Figure 2. The pitch representation (a) circular configuration of pitch and (b) Shepard’s helix model [37] 


Algorithm 1. Generating the chromagram of voice signals 
/*Set the initial value of the parameters */ 
Set the number of tone heights h=12; 
Set the number of bin B=12; 
Set the fundamental frequency fy =55 Hz; 
Load the sound data into vector X; 


Read the sampling rate from the soud file and store into variable F,; 


i x 
Normalize the data by Xnorm = x’ 


Set window size W; 
length 
Calculate the number of frame N=, 
/* Set the frequency range to compute the spectrogram*/ 
Set minimum frequency fmin = 0; 
Fs 


Calculate the maximum frequency fmax = ; 


Calculate the chromatic scale fi= fo2" 
Compute the discrete-time Fourier transform of the signal vector X(k) 
/*Determine the log-magnitudes of the respective DFT co-efficients */ 
while i<N 
while k<B 


Vk = dines, WV. 
end while 
Convert Vk into Vx, 
end while 
Generate chromagram C(m,t) 


In this work, a CNN is employed as the binary classifier. The CNN model presented in [38] is used 
as the base to implement the classifier. The CNN model includes two networks namely feature extraction 
network and classifier network. The input data (i.e., chromagram) is applied to the feature extraction 
network. The extracted feature map is applied to the classification neural network. The feature extractor 
network consists of piles of convolutional layer and pooling layer pairs as shown in Figure 4. To avoid 
computational burden on the system, one convolutional layer is used as the feature extractor network in this 
work. The CNN model uses 20 convolutional filters of size 9 x 9. The feature map produced by the 
convolutional filters is processed by an activation function. The rectified linear unit (ReLU) function is used 
for this purpose. The output produced by the convolutional layer is then passed through the pooling layer. In 
this work, a 2 X 2 matrix for pooling is used to find the mean value from the input data. The hidden layer has 
100 nodes that also use the ReLU activation function. The output layer of the CNN contains a single node as 
the decision made by the classifier is binary and the SoftMax function is used as the activation function at the 
output node. 
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Figure 3. The chromagram of normal voice and dysphonic voice 
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Figure 4. The architecture of the convolutional neural network 


3. RESULTS AND DISCUSSION 

To measure the performances of the proposed algorithm, the four machine learning performance 
parameters, namely, accuracy, precision, recall, and F1 score [39], [40] are used. Ten simulations were 
conducted by using the chromagram of normal and pathological samples. First, the CNN is trained with the 
chromagram of 50 normal samples and 50 pathological samples. A five-fold cross-validation method is used 
to ensure the accuracy of the training. Once trained, the chromagram of 50 other normal and pathological 
samples are used to test the network’s performance. The proposed algorithm’s training, validation, and 
testing results are listed in Table 1. This table shows the proposed system achieves an average accuracy of 
98.33%, 94.11%, and 84.99% for training, validation, and testing respectively. 

The corresponding classification matrix is shown in Table 2. Based on the data presented in Table 2, 
it is observed that the proposed system can correctly detect pathological voices with an accuracy of 90.00%. 
On the other hand, the system can detect normal voices with an accuracy of 80.00%. This accuracy shows 
that the system performs almost equally in detecting normal and dysphonic voices. The classification 
matrices also show that the chromagram mistakenly classifies the pathological voices as a normal voice with 
a probability of 10.00% only. Also, the chromagram misclassifies the healthy samples as pathological 
samples with a probability of 20%. The performances of the proposed algorithm are listed in Table 3. This 


A novel convolutional neural network based dysphonic voice detection algorithm ... (Rumana Islam) 


5516 O ISSN: 2088-8708 


table also shows that the proposed algorithm achieves accuracy, precision, recall, and F1 score of 85.00%, 
81.80%, 90.00%, and 85.70%, respectively. 

Finally, the performances of the proposed algorithm are compared with other related published 
works and the comparison has been presented in Table 4. As listed in the table, the spectrogram audio feature 
has been used in [25] and [31] and the authors have achieved an accuracy of 71% in both the works. Table 4 
also shows that the proposed system achieved an accuracy of 85% with the chromagram and this accuracy is 
higher than that achieved in [25], [31]. Not only that, this accuracy is even close to those of some other 
related works presented in [24], [27], [28], [30]. The comparison listed in Table 4 also shows that the 
chromagram is a very useful audio feature to detect voice pathology provided a suitable classifier like the 
CNN is used. Comparing the results of the proposed system with those of other multiple feature-based 
systems, it can be concluded that a single feature like chromagram is sufficient to detect pathological voices 
with a high accuracy and hence multiple acoustic features can be avoided to reduce the computation burden 
of the pathological voice detection system. 


Table 1. Training and testing accuracies for the chromagram 


Simulation No. Accuracy (%) 
Training Validation Testing 
1 95.83 92.00 79.16 
2 95.83 90.00 83.33 
3 100.00 93.45 87.50 
4 100.00 92.33 87.50 
5 100.00 96.67 87.50 
6 95.83 92.33 79.16 
7 100.00 98.00 87.50 
8 100.00 96.00 83.33 
9 95.83 92.33 87.50 
10 100.00 98.00 87.50 
Average 98.33 94.11 84.99 


Table 2. The classification matrix for the chromagram 
Prediction (%) 
Actual Normal Pathology 
Normal 80.00 20.00 
Pathology 10.00 90.00 


Table 3. The performance measures for the chromagram 
Performance Measures _ Chromagram (%) 


Accuracy 85.00 
Precision 81.80 
Recall/Sensitivity 90.00 
F1 Score 85.70 


Table 4. The comparison of the performances 


Research works Phonemes Features Tools Accuracy 
Wu et al. [31] Vowels, Speech Spectrogram CNN, CDBN 71% 
Fang et al. [24] Vowels Mel frequency cepstral SVM, GMM, DNN 94.26% (male) 
coefficients (MFCC) 90.52% (female) 
Jun and Kim [25] General voice samples Mel spectrogram DNRNN 71% 
Wang and Sustained vowel ‘/a/’ Feature vectors Hidden Markov model (HMM), 95.2% 
Cheolwoo [27] GMM, and SVM 
Sellam,and Tamil phrases Feature vectors SVM, and RBFNN 91% (RBFNN) 
Jagadeesan [28] 83% (SVM) 
Sassou [30] Japanese Vowel Higher-order local auto- FENN, and autoregressive hidden 87.75% 
correlation (HLAC) Markov model (AR-HMM) 
Proposed Method _ Sustained Vowel ‘/a/” Chromagram CNN 85.00% 


4. CONCLUSION 

This paper presents a CNN-based non-invasive pathological voice detection algorithm using 
chromagram feature of voice samples. The simulation results suggest that the chromagram is a useful audio 
feature that can be used for pathological voice detection algorithms, although chromagram only attracted the 
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attention of the researchers in music detection. Dysphonia causes a change in the pitch of the sound and 
hence affects the chromagram feature more compared to other features available in the literature. Hence, a 
higher accuracy is achieved with the chromagram compared to those of other unique features. Other 
performance parameters including accuracy, precision, recall, and F1 score, also confirm the effectiveness of 
the proposed algorithm. The performances of the proposed algorithm have been compared with those of other 
related works available in the literature. This paper also shows that a pathological voice detection system 
implemented with the chromagram voice feature and the CNN can outperform some other existing systems 
available in the literature. The proposed algorithm discriminates only the dysphonic voices from the normal 
voices. However, determining the progression level of voice pathology is another challenging task and this 
issue has not been addressed in this work. Other popular machine learning algorithms such as SVM, KNN, 
and GMM also need to be investigated in the future to ensure the comparative effectiveness of the proposed 
algorithm. 
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